Rare variants analyses suggest novel cleft genes in the African population

Non-syndromic orofacial clefts (NSOFCs) are common birth defects with a complex etiology. While over 60 common risk loci have been identified, they explain only a small proportion of the heritability for NSOFCs. Rare variants have been implicated in the missing heritability. Thus, our study aimed to identify genes enriched with nonsynonymous rare coding variants associated with NSOFCs. Our sample included 814 non-syndromic cleft lip with or without palate (NSCL/P), 205 non-syndromic cleft palate only (NSCPO), and 2150 unrelated control children from Nigeria, Ghana, and Ethiopia. We conducted a gene-based analysis separately for each phenotype using three rare-variants collapsing models: (1) protein-altering (PA), (2) missense variants only (MO); and (3) loss of function variants only (LOFO). Subsequently, we utilized relevant transcriptomics data to evaluate associated gene expression and examined their mutation constraint using the gnomeAD database. In total, 13 genes showed suggestive associations (p = E−04). Among them, eight genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) were consistently expressed in relevant mouse and human craniofacial tissues during the formation of the face, and three genes (ABCB1, TTC28, and PDZD8) showed statistically significant mutation constraint. These findings underscore the role of rare variants in identifying candidate genes for NSOFCs.

among others 3 .In the United States, the annual cost of hospital stays for children with OFCs was over 400 million in 2013 4 .Moreover, the team of experts required for OFC care is often unavailable in resource-limited settings, leading to significant inequalities in cleft management 5 .
Developing preventive or improved therapeutic strategies for NSOFCs requires a thorough understanding of their etiology.As with many complex traits, the etiology of NSOFCs is multifactorial, with genetic factors playing a considerable role 6 .According to a recent review, over 60 (> 40 associated with NSCL/P) risk loci have been implicated mainly through common variant association studies 7 .However, all identified loci/genes are estimated to explain only a small fraction (~ 25% for NSCL/P and even less for NSCPO) of the estimated heritability 8 .Low frequency/rare variants, gene-gene interactions, and gene-environmental interaction effects may likely explain the missing heritability 9 .
The role of rare variants in complex traits is well documented 10 , and a recent study found rare variants to be responsible for a larger proportion of the missing heritability in complex traits etiology 11 .However, studies evaluating the role of rare variants for NSOFCs have been restricted largely to the resequencing of known cleft candidate genes, limiting the discovery of novel genes.Although resequencing has provided insights into the burden of rare variants in these candidate genes, the small effect sizes often estimated for complex traits, together with the low MAF of rare variants (minor allele frequency [MAF] ≤ 0.01) makes the conventional single variants association analysis underpowered [12][13][14] .A more efficient approach to rare variant analysis is to select a fixed MAF threshold and conduct an aggregate test on all variants with a MAF below this threshold within a specified region (e.g.gene), either by assigning equal weight to all variants identified or by varying weights based on the estimated variance of each variant under the null hypothesis of no association 13 .This approach has been applied successfully in discovering genes associated with complex traits like blood pressure, myocardial infarction, and schizophrenia [15][16][17] .
Attempts at leveraging rare variant aggregation to identify novel genes for NSOFCs are relatively new and have been limited to samples of individuals of European ancestry 8,18,19 .Additionally, these studies included all the identified rare variants within a gene, an approach that has been shown to be less sensitive due to the inclusion of synonymous variants, which often do not contribute to disease etiology 20 .To address these limitations in previous studies, we conducted gene-based rare variant aggregation tests, including only the protein-altering rare variants using our African genome-wide association study (GWAS) data.We hypothesized that genes enriched for rare protein-altering variants associated with NSOFCs would contribute to the etiology of NSOFCs.Further, we utilized transcriptomics data to provide additional evidence for the associated genes.The African population, the ancestral origin of modern humans, harbors the greatest number of genetic variations and, thus, provides a considerable opportunity for genetic discoveries 21 .

Discussion
We conducted gene-based rare variant aggregation tests to identify novel candidate genes that could explain the missing heritability for NSOFCs.In total, we identified 13 genes with suggestive associations primarily driven by rare missense variations.Eight genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) showed consistent expression in relevant mouse and human craniofacial tissues during the formation of the face.Further, three genes (ABCB1, TTC28, and PDZD8) were predicted to be intolerant to missense variations.
www.nature.com/scientificreports/Using biological plausibility to prioritize loci/genes with suggestive association in GWAS has been previously shown as a valid approach to bypassing the large sample size requirement needed to achieve significant association 22 .While gene mutation constraint is a good metric for identifying pathogenic genes, it is more commonly seen with a dominant disease-causing gene and may not be informative in other disease models (e.g., recessive).Thus, we prioritized the 8 associated genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) with consistent expression in relevant craniofacial tissues during human and mouse face development.Majority of these genes (TTC28, CSAD, EXPH5, SLC16A9, ALKBH8, and CENPF) showed biased expression towards mesenchymal cells subtypes from human craniofacial tissues.This could point to the importance of mesenchymal cells in palate formation since these genes were associated with either NSCLP or NSCPO and not NSCLO.Moreover, this could also be due to the stage (CS17) of embryonic development at which the facial prominences were harvested.The CS17 stage (~ 7 weeks post fertilization), a period that coincides with the later stages of lip formation and the beginning of palate formation.Furthermore, the lower prevalence of NSCPO compared to NSCL/P suggests a higher burden of variants threshold in NSCPO compared to NSCL/P.This may explain why we identified more genes enriched for rare variants associated with NSCPO/ NSCLP compared to NSCLO.
Five genes (ABCB1, TTC28, PDZD8, CENPF, and ALKBH8) have been previously implicated in NSOFCs or diseases presenting with cleft phenotypes.The ABCB1 and TTC28 were associated with NSCL/P while the PDZD8, CENPF, ALKBH8 genes are associated with NSCPO.These genes showed consistently high expression and enrichment in mouse craniofacial tissues especially the palate.The ABCB1 gene is an ATB binding cassette www.nature.com/scientificreports/gene that functions to regulate fetal exposure to xenobiotics through the placenta 23 .Single nucleotide variations in the ABCB1 gene have been reported to increase the risk of NSCL/P 23 .Additionally, knockout of the ABCB1 gene in mice increases susceptibility to cleft palate 24 .The TTC28 gene is in the 22q12.2region and previous case report on patients with microdeletion of this region implicate this gene as potential candidate for pierre robin sequence-a condition that presents with cleft palate 25 .Furthermore, copy number variations overlapping this gene have been reported in cleft palate patients 26 .The PDZD8 gene assists with lipid transfer from the endoplasmic reticulum to the endosomes and lysosomes 27,28 .Burden of variations though not statistically significant have been previously reported in this gene among patients with cleft lip with or without palate 29 .CENPF is a kinetochore associated protein that colocalizes with the IFT88 (a ciliopathy gene) and compound heterozygous mutations in the CEPNF gene were reported in a human fetus with ciliopathic malformations, including cleft palate 30 .Mutations in ALKBH8 cause intellectual developmental disorder, autosomal recessive 71, MRT71 (OMIM #618504) in humans.This condition presents with craniofacial dysmorphic features, which include long lips with V-shaped upper lip, macrostomia, and retruded mandible 31 .Macrostomia and retruded mandible may cause cleft palate by impeding the elevation of palatal shelves during palate formation 32 .We identified three potential novel genes (CSAD, EXPH5, SLC16A9) for NSOFCs.The burden of variants in CSAD and EXPH5 were associated with NSCLP while those in the SLC16A9 gene were associated with NSCPO.These genes, except for the EXPH5 gene which showed a suggestive association with NSOFCs in gene-by-sex interaction analysis using the same GWAS data 33 , lack previous reports of direct association with NSOFCs phenotypes.However, they have been implicated in processes crucial for craniofacial development.The EXPH5 gene, for instance, has been shown to play a role in cell-cell adhesion 34 ; a critical process in craniofacial morphogenesis.The SLC16A9 gene is linked to lipid metabolic traits 35 ; a functionally relevant downstream target of the non-canonical transforming growth factor beta (TGFbeta) signaling-a signaling mechanism involved in Color corresponds to the average expression of that gene within cells of that cluster (Low = blue, high = red).(b) Expression analysis in craniofacial structures during human face development.Heatmap of expressed genes across all samples in the developmental series.Z-scores for normalized gene counts ranging from high (yellow) to low (blue) expression.The dendrogram on the left is the hierarchical clustering of all samples.face formation.The CSAD gene functions in the biosynthesis of taurine and the role of taurine in organogenesis has been demonstrated in mice 36 .
In the current study, we did not find associations with previously reported genes from rare variant association studies on NSOFCs in other populations.Similarly, a previous multiethnic gene-based GWAS on NSCL/P showed that different genes were enriched for low-frequency/rare variants associated with NSCL/P across genetically defined ancestral groups (Asians, Europeans, and Latin Americans) without overlap 8 .Additionally, none of the genes showing suggestive or significant associations in this study, as well as the previous study, were known NSOFCs risk genes.We replicated the ABCB1 gene association which was previously reported for NSCL/P in a common variants' association study.While some reports have demonstrated the accumulation of rare variants in genes previously identified through common variants association analyses 37 , suggesting a convergence of common and rare variants in loci/genes associated with NSOFCs phenotypes, other reports suggest that common and rare variants may act through separate loci/genes 29 .For instance, targeted sequencing of regions surrounding www.nature.com/scientificreports/genome-wide significant loci for NSOFCs showed no evidence of rare variants burden in genes/regulatory regions proximal to these loci 29 .Additionally, rare variants are population-dependent, which could explain why we did not replicate previously reported genes/loci from rare variant association studies in other populations.Furthermore, our tests of association require that we exclude any gene with only one rare variant, even if the same gene harbored common variants.This might have resulted in the omission of genes with contributions from both rare and common variants if participants in our cohort only harbor one rare variant in the gene.Therefore, future studies should consider a model that allows for the incorporation of both common and rare variants.
Our study has some limitations.First, we used array-based genotype data for discovery.This approach means some rare variants were not examined.Second, we controlled for population stratification by adding the top 10 genotype PCs as covariates in our gene-based association regression models.Adjusting for top PCs has been shown to prevent p-value inflation and reduce false positive rates in common variant analysis, but its performance for rare variant analysis remains controversial 38 .The reason is that rare variants, being newer mutations, reflect a more granular population substructure compared to common variants 39 .Finally, we restricted our analysis to only the coding (missense and Lof) and splice-altering variants.Rare variants including insertions/deletions in non-coding regulatory regions also contribute to the etiology of NSOFCs; however, defining these regions' analytical units and their functional characterization remains a challenge.Hence, future studies should use WGS data for the gene-based analysis to capture more rare variants and leverage annotated craniofacial enhancers regions to analyze rare variants in non-coding regions.
In summary, we identified 13 genes with suggestive associations in our GWAS data.Among the 13 genes, 3 genes (ABCB1, TTC28 and PDZD8) were predicted to be intolerant to missense variations.Human and mouse transcriptomics data further supported the association of 8 genes.Of the 8 genes, five genes (ABCB1, TTC28, PDZD8, CENPF, ALKBH8) were previously associated with NSOFCs or diseases presenting with cleft phenotypes.The remaining three genes (CSAD, EXPH5, SLC16A9) are potentially novel candidate genes for NSOFCs.

GWAS study participants
The details of the GWAS participants have been previously published 40 .Briefly, NSOFCs case children were recruited during surgical repair at cleft clinics and free surgical missions sponsored by Smile Train in Nigeria, Ghana, and Ethiopia.The cleft surgeons at each participating site used a standardized phenotyping protocol (physical examination and clinical photographs) to confirm NSOFCs status.Additionally, echocardiography was used to rule out the presence of congenital heart defects to ensure nonsyndromic status.Control children were those without a birth defect diagnosis attending immunization/welfare clinics at the same center where the case children were recruited.To be eligible to participate in the study, case and control children must have biological parents of African ancestry who reside in Africa.Our sample included 1019 NSOFCs case children and 2150 unrelated control children.Among the cases, 810 had non syndromic cleft lip with or without palate (NSCL/P)-394 non-syndromic cleft lip only (NSCLO), 420 non-syndromic cleft lip and palate (NSCLP), and 205 had non-syndromic cleft palate only (NSCPO).The distribution of the GWAS study participants by cleft status and country of origin is shown in Table 2.

Data collection, DNA extraction, and genotyping
Demographic information (age, sex, and residential location) and limited exposure information were obtained.Saliva specimens were collected using the Oragene saliva kit, de-identified and shipped to the Butali laboratory at the University of Iowa.DNA was extracted from the saliva using the standard Oragene saliva DNA extraction protocol and quantified using Qubit (http:// www.invit rogen.com/ site/ us/ en/ home/ brands/ Produ ct-Brand/ Qubit.html;Thermo Fisher Scientific, Grand Island, NY).As part of internal QC, Taqman XY genotyping was used for sex confirmation.Subsequently, aliquoted DNA was sent to the Center for Inherited Disease Research (Baltimore, Maryland, USA) for genotyping.The Illumina Multi-Ethnic Genotyping Array MEGA v2 15070954 A2 (genome build 37), which has over 2 million variants including over 60 000 rare variants selected from populations of African origin, was used for the genotyping.Details on genotyping and QC measures have been previously published 40 .www.nature.com/scientificreports/

Data analysis
Gene-based analyses Variant predication was performed using annotate variation (ANNOVAR) to identify the functional consequences (synonymous, nonsynonymous, and splice-altering) of each variant.Splice altering and non-synonymous variants (variations resulting in either amino acid change or premature termination of the protein) were filtered against the 1000 genome (1 kg) population databases (http:// www.1000g enomes.org/) to identify rare variants (found in ≤ 1% of Africans included in the database).Additionally, we filtered for only the variants with a MAF ≤ 1% in the controls included in our sample (Supplementary Table 2).Subsequently, genes with two or more rare non-synonymous variants in the data were filtered to satisfy the aggregation requirement for the proposed analyses (Fig. 3).For the analysis, both non-synonymous and splice-altering variants were included and referred to as "protein-altering".Three gene-based collapsing models were used: (1) protein-altering (PA),  41 .The CMC test, being a burden test, assumes the same direction of effects for all the variants within a gene, whereas SKAT is a variance component test that allows for an opposing direction of effects 41 .The SKAT-O test is an omnibus test that is more robust and efficient across different scenarios 41 .In all three tests, population stratification and sex were controlled for by adding the top 18 principal components (PC) and child sex into the regression-based models.The lack of prior knowledge about the underlying biology of these variants precluded the ability to select one optimal test.Hence, the decision to conduct the three tests where the CMC and SKAT will show rigor and SKAT-O will confirm reproducibility.The Bonferroni correction was used to adjust for multiple testing and set the cut-off for statistical significance at a 5% error rate to 0.05 divided by the number of genes tested and a 10 2 higher threshold for suggestive significance.A gene was considered significant if it showed a statistically significant or suggestive significant association in either CMC and SKATO or SKAT and SKATO.Rare variant aggregation tests were conducted on our GWAS data using the SKAT R package (https:// cran.r-project.org/ web/ packa ges/ SKAT/ index.html) for rare variant association tests implemented under the case-control study design.Further, we used the genomeAD gene mutation constraint prediction tool (https:// gnomad.broad insti tute.org/ help/ const raint) to identify associated genes level of intolerance to mutational changes.

Gene expression analyses
To provide biological insight, expression of the associated genes during organogenesis of the human and mouse faces was examined.The human craniofacial gene expression dataset was generated from extensive bulk mRNA sequencing of human craniofacial tissue between 4 and 8 weeks post-conception.Since gene expressions are time and tissue-specific, the dataset included bulk mRNA sequencing at five time points corresponding to Carnegie stages of the embryonic period (CS13, CS14, CS15, CS17, and CS22), along with a culture model of cranial neural crest cells.This resulted in transcriptomic data in a time series covering nearly one-half of the embryonic period, during which critical events in craniofacial development occur.Additionally, the dataset included single-nuclei RNA-seq of craniofacial prominences of human CS17 embryos, corresponding to approximately 7 weeks postfertilization, a period that coincides with the later stage of lip formation and the early stage of palate formation.Additional details on the RNA seq data quality control and analysis was reported by Yankee et al. 28 .SysFACE analysis was performed as previously described 33 using GSE7759, GSE22989, GSE31004, and GSE11400 microarray data (Affymetrix Mouse Genome 430 2.0 Array) and GSE55965 microarray data (Affymetrix Mouse Gene 1.0 ST Array).The datasets were analyzed using affy package in R. Multiple probesets representing individual genes were normalized and the probeset with highest median expression was considered representative of gene expression.WB data generated on Affymetrix Mouse Genome 430 2.0 Array platform as previously described was used for enriched expression analysis.

Figure 1 .
Figure1.(a) Bubble plots of gene expression for associated genes in each of the major cell types from human craniofacial tissues of CS17 embryos.Size correlates to the percent of cells per cluster which express the gene.Color corresponds to the average expression of that gene within cells of that cluster (Low = blue, high = red).(b) Expression analysis in craniofacial structures during human face development.Heatmap of expressed genes across all samples in the developmental series.Z-scores for normalized gene counts ranging from high (yellow) to low (blue) expression.The dendrogram on the left is the hierarchical clustering of all samples.

Figure 2 .
Figure 2. SysFACE-based expression analysis of candidate genes in mouse facial development.Mouse orthologs for 10 of the 13 human candidate genes were examined using the SysFACE tool that is based on microarray gene expression data from isolated facial tissue in mouse development.Heat-map denotes the relative expression of individual genes at various stages of mouse embryonic (E) or postnatal (P) development in specific facial tissues, namely (A) Mandibular and maxillary columnar epithelium, (B) nasal eminence/prominence, and mandibular and maxillary arch, (C) frontonasal and mandible, (D) maxilla and palate.The intensity of the color in the heatmap (row-wise) is representative of the extent of candidate gene expression based on the average fluorescence signal intensity in the specific tissue.Note that for palate, there were independent datasets for E13.5 and E14.5 and these are denoted as E13.5a, E13.5b, etc. FaceBase datasets generated on Affymetrix Mouse Gene 1.0 ST Array microarray were meta-analyzed for (A,B) and Affymetrix Mouse Genome 430 2.0 Array microarray datasets were meta-analyzed for (C,D).

Figure 3 .
Figure 3. (a) Summarized flow chart for the analytical pipeline.(b) The process of sorting for variants within a gene of interest.The left of the image shows all variants identified within the gene after sequencing.The variants of interest on the right were predicted to result in a change in amino acid or affect splicing by in-silico predictive tools and were included in our analysis.Color codes; Red: Non-synonymous variants in protein-coding regions; Yellow: Synonymous variants (not selected for analysis); Blue: Splice-site variants.

Table 1 .
Gene-based results for the protein-altering and missense only models.Positive Z-scores indicate more constraint (fewer observed variants than expected), and negative scores indicate less constraint (more observed variants than expected) as observed in the genomeAD dataset.a Statistically significant missense constraint Z score.

Table 2 .
Distribution of the GWAS study participants by country of origin and cleft status.GWAS Genomewide association study, NSCL/P Nonsyndromic cleft lip with or without cleft palate, NSCPO Nonsyndromic cleft palate only.
missense only (MO), and (3) loss of function only (LOFO).The different phenotypes (NSCL/P and NSCPO) were analyzed independently and separately.NSCL/P was further subdivided into nonsyndromic cleft lip only (NSCLO) and nonsyndromic cleft lip and palate (NSCLP).Gene-based rare variant aggregate tests were used to identify genes enriched in rare variants associated with NSOFCs.Three rare variant aggregation tests were conducted: the combined multivariate and collapsing (CMC) test, the sequence kernel association test (SKAT), and the SKAT-O test.The first two tests are complementary, operating under different assumptions